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m THE CLArVTS; 



Please substitute the foUowing claims for the same-numbered claims in the application: 



1. (Original) A method for determining a degree of similarity between documents in a given 
dpcimient collection the method comprising flie otcpa of: 

modeling all said docume nts as tahft!Pf^ ttee representation^- , 

building a compnteriz<^d di ctionary o f p a th r enresentatinn^ rP>lating p ^^« th^t . 

said dofcuTTient*!- 



storing, for at least two said documents, said labeled tree representations of respect 

■ 

documents; 



ive 



storing, for said at least two said documents, saM path representations relating to smd 
palhs that occur in the s^d documents from root nodes to leaf nodes in the said labeled tree 
representations of s^d respective documents; and 

Biyesentinfi each of s^irt documents iti ..aid documsnt r^Hection a.; a/lh,-^^^^»„..i 
vector comprising an element / denoting a v«lne . featare assnci'^t^ with p.rt.Vni.r j...^ 
wherein said feature comnrises anv of a p re sence or ahgenre of said narrir ular oath in ^iA 
documents and a frequency of Qccuirencfi nf s aid taartic»lar p ath in said dnenm^nt^- 

calculating a measure of similarity between two of the documents based upon the 
frequency of occurrence of similar paths specified by the path representatjonsmi: and 

aa ng said measure of .similarity to clii^r^r . pluralil^ of f l^r^. mients comp ri.iHfr cimjl., 
information, wl^gy^p said documents compri . '^ anv of weh nacr^ doc ument, ^nd pyt.n«i>,u 
Markup La nguage fXML) documents. 
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Whereiq two documems that differ r^nly in the fmq n^r-. Y of occurrenG^ nfth^ p .fh. 
associated with ,said twp documents ^ COnsidarerf tn Iv, more similar tn .^ch other th.n tv»n 

documents that differ in the occurrence of p aths 



2. (Currently Amended) The method as claimed in claim 1, wherein the tree representation 
is a Document Me^el Object Model representation. 



3. (Original) The method as claimed in claim 1, fortber comprising the step of generating a 
path representation for a path of a document as a sequence of labels representative fiom a root 
node to a leaf node in the labeled tree representation of the document 



4. (Original) The method as claimed in claim 1 , further comprising the step of storing, as 
path representations, sets of sequenced labels representative of distinct paths in a labeled tree 
representation of a corresponding document. 



5. (Origmal) The method as claimed in claun 4, further comprising the step of storing a path 
dictionary (Dict^a» = {pi. pz, .... pu)) of distinct paths collated from a tree representation for a 
document. 



6. (Original) The method a$ claimed in claim 5, further comprising the step of eliminatmg 
selected paths from the path dictionary {Dictpaa^. 
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7. (Original) The method as claimed in claim 6, wherein paths that occur highly frequendy 
or highly infrequently are eliminated from the path dictionary (Dictp^u^). 



8. (Original) The method as claimed in claim 7. further comprising the step of computing 
the frequency of occurrence 0^<p^)) of a path in a document (dj). 



m claim 8, further comprising the step of computing 
the maximum number of instances (4« - raax^f/p.)) in which a path (pi) in the document (4) 



occurs. 



10. (Original) The method as claimed in claim 9, further comprising the step of storing a 
representation of die document (dj) as a iV-dimensional vector (Idjj, dji. .., dj^}, where djk = 
fM>kilfrnc«. l^k^N)of relative frequencies of occurrence (f/pt)) of paths (p^ in the document 

(4)- 



1 1 . (Original) The method as claimed in claim S, further comprising the step of computing 
the mimmum number of instances (f„,„ = min<^^^) in which a path (p,) in the document (dj) 



occurs. 



12. (Original) The method as claimed in claim 10, further comprising the step of computing 
the similarity between a pair of documents (rf,, d,) as a function {sim(d„ d,)) of metrics relating 
the number of paths common to the respective documents (d/ , d,). 
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13. (Original) The method as claimed in claim 12, wherein the function for computing the 
similarity between a pair of documents {di , di) 



(sim(di, dt) = sim{d,,d,)=^ ) 



is the quotient of a numerator, defined as the sum for all paths (A = 1 ... AO of the 
minimum number of instances (min(</,A, do,)) in which paths occur in the respective documents 
{di, dt), and a denominator, defined as the sum for all paths {k =\..,N) of the maximum number 
of instances (min(i^fo rf*)) in which paths occur in the respective documems (4, d,). 



14. (Original) The method as dairacd in claim 1, wherein the tree representation of a 
document includes a positional index, which represents, for a node {n), the number of previous 
sibling nodes with the same label as that of node (m). 



as & 



15. (Original) The method as claimed in claim 14, iiirther comprising the step of storing 
path representation a set that defines positional information of sibling nodes under a parent node. 



1 6. (Original) The method as claimed in claim 1 5, further comprising the step of storing 
precise path representations that precisely define a document structure, and generalised path 
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representations that partially generalise structural aspects of precise path representations of a 
document 



1 7. (Original) The method as claimed in claim 1 6, wherein the step of calculating the 
measure of similarity involves determining a total number of precise path representations of 
document that are either shared by the other document, or are a subsumed subset of at least 
of the generalised path representations of the other document. 



one 
one 



1 8. (Original) The method as claimed in claim 1 7, further comprising the step of normalising 
the measure of similarity by a term that represents the number of unique path representations 
shared by the two documents. 



1 9. (Original) The method as claimed in claim 1 8, wherein the number of unique path 
representations is calculated by adding the number of path representations for each document, 
and subtracting fiom this total the number path representations shared by the two documents. 



20. (Original) The method as claimed in claim 14. further comprising the step of storing as a 
path representation a sequence of tenns separated by a delimiting symbol, in which each term is 
represented by a label and a parenthesised predicate that specifies the positional index of the 
term either specifically or generally. 



21-22. (Cancelled), 
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23 . (Currently Amended) A prograxa storage device readable by computer, tangibly 
embodying a program of instructions executable by said computer to perfonn a method for 
determining a degree of similarity between documents in a given document coUeetinti. the 
method comprising: 

modeling all said docmn ents as labeled tree renresentations; 

l^uilding a computerized dictionarv of path rfip re^ entatiQn.s relating to p aths that ^rmr m 
said documents: 

Storing, for at least two said documents, said labeled tree representations of respective 
documents; 

storing, for sdd at least two said documents, said path representations relating to §aid 
paths that occur in fee ga^ documents from root nodes to leaf nodes in the sai^ labeled tree 
represaatations of &e sdd respective documents; a»4 

representing each of said documents in s qid documftnt collection as an AA.dimen «;inngl 
v ector comprising an element / denotin g a value of a feature associated with a p a rticular oath, 
wheyew said featore comprises anv of a pr e sence or absence of said partJcular nath in s^H 
documents and a frequency of occurrence o f said narti cu|ar nath in said documents- 

calculating a measure of similarity between two of the documents based upon the 
fiequency of occurrence of similar paths specified by the path repr^entationsmi; and 

ttSing said measure of simSlaritv to c]n^ter a ntimlttv of documents comprising similar 
i nfonnatiQp, wherein said documkts comprise anv of weh p age documents and eVti^nsihlft 

■ 
I 

Markup La nguage CKML\ documfc;nts- 

I 
i 
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wfaerein tvyp ^^ocu ment s t hat dif fer only in the frequency of occuirence of th^ p ^th^ 
associated witl^ two docmncnts are cnn.^id^ r ed to he mnre ai T nUar to each nrii^ th^r. fw^ 
documents that differ in the occAirrencc. nf p athg, 



24. (CurrenUy Amended) The program storage device in claim 23, wherein said method 
iurther comprises the tree representation is a Document Med^ Object Model representation. 



25. (Previously Presented) The program storage device in claim 23, wherein said method 
further comprises the step of generating a path representation for a path of a document as a 
sequence of labels representative from a root node to a leaf node in Ihe labeled tree 
representation of the document. 



26. (Previously Presented) The program storage device in claim 23, wherein said method 
further comprises the step of storing, as path representations, sets of sequenced labels 

t 

representative of distinct paths in a labeled tree representation of a corresponding document. 



[[28]] 21. (Cunrently Amended) The program storage device in claim 23, wherein the tree 
representation of a document includes a positional index, which represents, for a node (n), the 
number of previous sibling nodes with the same label as that of node («). 



[[29JJ 2S. (Currently Amended) A computer system operable for deteimining a degree of 
similarity between documents in a given document collecfion the computer system comprising: 

10/629,133 8 
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g firot otorofio unit oporablo for Dtoring labolod tree roprocontations of roopcotivc 
ilfrnnmmf i f o r nt Ir n it nv o lio mmontti , 

a Dooond otorage unit opcrablo for atoiigg. for at Icgjt two docomGnto, path 
reprosentatiODfl rolating to putha that ooom- in thn rinmimrmtn ftr^m m nf t o leaf iiod cc in the 
lab e led tree reproogntationa of the? respoctivc dQcmiionta; an d 

g . calculator ODorablo for cflloiilntitig ^ m»..r-»r» ^jjmifnriTj'b otn r cc ntrivo of the 
dooumcnta baaed upon dio frpaucncv nf »^,>iTrr^».n nf nr^rilnrrn th T ij prrifio d by t ho p g ai 
reprcsoxttationc i . 

means for modeling all said document as labeled tree representatinn.<f 
means for building a computerized dictionary of path renrasentations relating to p aths 
that occur in said docump-Tif*:; 

means fpy storing, for at least two said docum e nts, said labeled tree renra^entations of 
respective document^ 

toma for storing, for said at least two said d^r.i unents. said path represanfatjons relating 
to said paths that OQCUr in said documents fi pm root nodps to leaf nndftQ in said Iflh^l^^ tr^^ 

» 

represent ations of said respective documents! 

means for y^pyesenting each of said docun^e ^ts in said document collef>^on as an IV. 
d jm , ensional vector co mprising an element / denotinp a value of a feature associated with a 
particular path, wherein said feature comp rises ^ of a presence or absence of >faid narticnl;.r 
gath jp sai4 documents and a firequencv of oc c unrence of said particulaf path in said don itpftnt^- 

means for calculating a measure of similgritv between txx ^ of the docmnents ba5!ed 
th? ftpquency of occurrence of similar nat^.. s pecified bv th^ path reni^^^ntafi^r..; 
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means for using said measure of similarity to cluster a plurality of documents comprismg 
yy^ formation. wherein said documents comprise any of web page documents and 

extensible Markup Language (XML) documents. 

wherein two documents that differ only in tlie frequency of occurrence of the paths 

associated with said two documents are considered to be more similar to each other than two 

I 

documents that differ in the occurrence of oaths, I 



[[30]] 29. (Currently Amended) The computer system device in claim [[29]] 28, wherein said 
■ m e thod further compriGos the tree ropr o G o ntation representations is a Document Modo t Object 
Model representation. 



[[31]] 30. (Cuirently Amended) The computer system device in claim [[29]] 2S, whoroin said 
method further comprio es- lh e step of further comprising means for gcaierating a path 

s 

representation for a path of a document as a sequence of labels representa^ve from a root node to 
a leaf node in the labeled tree representation of the document 



[[32]] 31. (Cuirently Amended) The computer system device in claim [[2S[]] 28, wherein said 
m e thod furthor comprioos th e step of further comprising means for storing, as path 



representations, sets of sequenced labels represmtative of distmct paths in 
representation of a corresponding document. 



a labeled tree 
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[[33]] 32. (Currently Amended) The computer system device in claim [[29]] 2^ wheiein the tree 
representation of a dCMSumait includes a positional index, which represents, for a node («), the 
number of previous sibling nodes with the same label as that of node (w). 
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